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Abstract. We consider the restless Markov bandit problem, in which 
the state of each arm evolves according to a Markov process indepen- 
dently of the learner's actions. We suggest an algorithm that after T 
steps achieves 0(y/T) regret with respect to the best policy that knows 
the distributions of all arms. No assumptions on the Markov chains are 
made except that they are irreducible. In addition, we show that index- 
based policies are necessarily suboptimal for the considered problem. 

1 Introduction 

In the bandit problem the learner has to decide at time steps t — 1,2,... which 
of the finitely many available arms to pull. Each arm produces a reward in a 
stochastic manner. The goal is to maximize the reward accumulated over time. 

Following pp, traditionally it is assumed that the rewards produced by each 
given arm are independent and identically distributed (i.i.d.). If the probability 
distributions of the rewards of each arm are known, the best strategy is to 
only pull the arm with the highest expected reward. Thus, in the i.i.d. bandit 
setting the regret is measured with respect to the best arm. An extension of this 
setting is to assume that the rewards generated by each arm are not i.i.d., but 
are governed by some more complex stochastic process. Markov chains suggest 
themselves as an interesting and non-trivial model. In this setting it is often 
natural to assume that the stochastic process (Markov chain) governing each 
arm does not depend on the actions of the learner. That is, the chain takes 
transitions independently of whether the learner pulls that arm or not (giving 
the name restless bandit to the problem). The latter property makes the problem 
rather challenging: since we are not observing the state of each arm, the problem 
becomes a partially observable Markov decision process (POMDP), rather than 
being a (special case of) a fully observable MDP, as in the traditional i.i.d. 
setting. One of the applications that motivate the restless bandit problem is the 
so-called cognitive radio problem (e.g., [2]): Each arm of the bandit is a radio 
channel that can be busy or available. The learner (an appliance) can only sense 
a certain number of channels (in the basic case only a single one) at a time, 
which is equivalent to pulling an arm. It is natural to assume that whether the 
channel is busy or not at a given time step depends on the past — so a Markov 
chain is the simplest realistic model — but does not depend on which channel 



the appliance is sensing. (See also Example 1 in Section [3] for an illustration of 
a simple instance of this problem.) 

What makes the restless Markov bandit problem particularly interesting is 
that one can do much better than pulling the best arm. This can be seen al- 
ready on simple examples with two-state Markov chains (see Section [3] below). 
Remarkably, this feature is often overlooked, notably by some early work on 
restless bandits, e.g. [3J, where the regret is measured with respect to the mean 
reward of the best arm. This feature also makes the problem more difficult and 
in some sense more general than the non-stochastic bandit problem, in which the 
regret usually is measured with respect to the best arm in hindsight [I] . Finally, 
it is also this feature that makes the problem principally different from the so- 
called rested bandit problem, in which each Markov chain only takes transitions 
when the corresponding arm is pulled. 

Thus, in the restless Markov bandit problem that we study, the regret should 
be measured not with respect to the best arm, but with respect to the best policy 
knowing the distribution of all arms. To understand what kind of regret bounds 
can be obtained in this setting, it is useful to compare it to the i.i.d. bandit prob- 
lem and to the problem of learning an MDP. In the i.i.d. bandit problem, the 
minimax regret expressed in terms of the horizon T and the number of arms only 
is 0(\/r), cf. [5]. If we allow problem-dependent constants into consideration, 
then the regret becomes of order log T but depends also on the gap between the 
expected reward of the best and the second-best arm. In the problem of learning 
to behave optimally in an MDP, nontrivial problem-independent finite-time re- 
gret guarantees (that is, regret depending only on T and the number of states and 
actions) are not possible to achieve. It is possible to obtain 0(y/T) regret bounds 
that also depend on the diameter of the MDP [5] or similar related constants, 
such as the span of the optimal bias vector [7] . Regret bounds of order log T are 
only possible if one additionally allows into consideration constants expressed 
in terms of policies, such as the gap between the average reward obtained by 
the best and the second-best policy [6]. The difference between these constants 
and constants such as the diameter of an MDP is that one can try to estimate 
the latter, while estimating the former is at least as difficult as solving the orig- 
inal problem — finding the best policy. Turning to our restless Markov bandit 
problem, so far, to the best of our knowledge no regret bounds are available 
for the general problem. However, several special cases have been considered. 
Specifically, O(logT) bounds have been obtained in [5] and [5]. While the latter 
considers the two-armed restless bandit case, the results of [8] are constrained 
by some ad hoc assumptions on the transition probabilities and on the struc- 
ture of the optimal policy of the problem. Also the dependence of the regret 
bound on the problem parameters is unclear, while computational aspects of the 
algorithm (which alternates exploration and exploitation steps) are neglected. 
Finally, while regret bounds for the Exp3.S algorithm [4 could be applied, these 
depend on the "hardness" of the reward sequences, which in the case of reward 
sequences generated by a Markov chain can be arbitrarily high. 



Here we present an algorithm for which we derive Q{yT) regret bounds, 
making no assumptions on the distribution of the Markov chains. The algorithm 
is based on constructing an approximate MDP representation of the POMDP 
problem, and then using a modification of the Ucrl2 algorithm of [5] to learn 
this approximate MDP. In addition to the horizon T and the number of arms 
and states, the regret bound also depends on the diameter and the mixing time 
(which can be eliminated however) of the Markov chains of the arms. If the 
regret has to be expressed only in these terms, then our lower bound shows that 
the dependence on T cannot be significantly improved. 

2 Preliminaries 

Given are K arms, where underlying each arm j there is an irreducible Markov 
chain with state space Sj and transition matrix Pj . For each state s in Sj there 
are mean rewards fj(s), which we assume to be bounded in [0, 1]. For the time 
being, we will assume that the learner knows the number of states for each 
arm and that all Markov chains are aperiodic. In Section we discuss periodic 
chains, while in Section [5] we indicate how to deal with unknown state spaces. 
In any case, the learner knows neither the transition probabilities nor the mean 
rewards. 

For each time step t — 1,2,... the learner chooses one of the arms, observes 
the current state s of the chosen arm i and receives a random reward with 
mean r^s). After this, the state of each arm j changes according to the transition 
matrices Pj . The learner however is not able to observe the current state of the 
individual arms. We are interested in competing with the optimal policy tt* which 
knows the mean rewards and transition matrices, yet observes as the learner only 
the current state of the chosen arm. Thus, we are looking for algorithms which 
after any T steps have small regret with respect to tt* , i.e. minimize 

where denotes the (random) reward earned at step t and p* is the average 
reward of the optimal policy 7r* . (It will be seen in Section [5] that ir* and p* are 
indeed well-defined.) 

Mixing Times and Diameter If an arm j is not selected for a large number 
of time steps, the distribution over states when selecting j will be close to the 
stationary distribution pj of the Markov chain underlying arm j. Let pi be the 
distribution after t steps when starting in state s£ Sj. Then setting 

dj(t) := maxll/i* - p.^ := max V \pl(s') - Pj(s')\, 

ses j aes j g/£ ^ 

we define the s-mixing time of the Markov chain as 



Ti ix (e) : =rnin{teNK(t) <e}. 



Setting somewhat arbitrarily the mixing time of the chain to T^ iK := T^ ix (j), 
one can show (cf. eq. 4.36 in [10]) that 

T£ ix ( £ )<riog 2 il-Ti lx . (1) 

Finally, let Tj(s,s') be the expected time it takes in arm j to reach s' when 
starting in s. We set the diameter of arm j to be Dj := m.ax SiS ' £ s. Tj(s, s'). 

3 Examples 

Next we present a few examples that give insight into the nature of the problem 
and the difficulties in finding solutions. In particular, the examples demonstrate 
that (i) the optimal reward can be (much) bigger than the average reward of the 
best arm, (ii) the optimal policy does not maximize the immediate reward, (iii) 
the optimal policy cannot always be expressed in terms of arm indexes. 

Example 1. In this example the average reward of each of the two arms of a 
bandit is i, but the reward of the optimal policy is close to |. Consider a two- 
armed bandit. Each arm has two possible states, and 1, which are also the 
rewards. Underlying each of the two arms is a (two-state) Markov chain with 

transition matrix ( ^ 6 ^ ^ , where e is small. Thus, a typical trajectory of 

each arm looks like this: 000000000001111111111111111000000000 ... , and the 

average reward for each arm is \ . It is easy to see that the optimal policy starts 
with any arm, and then switches the arm whenever the reward is 0, and otherwise 
sticks to the same arm. The average reward is close to | — much larger than 
the reward of each arm. 

This example has a natural interpretation in terms of cognitive radio: two 
radio channels are available, each of which can be either busy (0) or available (1). 
A device can only sense (and use) one channel at a time, and one wants to 
maximize the amount of time the channel it tries to use is available. 

Example 2. Consider the previous example, but with e close to 1. Thus, a typical 
trajectory of each arm is now 01010101001010110. . ., and the optimal policy 
switches arms if the previous reward was 1 and stays otherwise. 

Example 3. In this example the optimal policy does not maximize the immediate 
reward. Again, consider a two-armed bandit. Arm 1 is as in Example [TJ and 
arm 2 provides Bernoulli i.i.d. rewards with probability ^ of getting reward 1. 
The optimal policy (which knows the distributions) will sample arm 1 until it 
obtains reward 0, when it switches to arm 2. However, it will sample arm 1 again 
after some time t (depending on e), and only switch back to arm 2 when the 
reward on arm 1 is 0. Note that whatever t is, the expected reward for choosing 
arm 1 will be strictly smaller than i, since the last observed reward was and 
the limiting probability of observing reward 1 (when t — > oo) is i. At the same 
time, the expected reward of the second arm is always \. Thus, the optimal 
policy will sometimes "explore" by pulling the arm with the smaller expected 
reward. 
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Fig. 1. Example 4- Dashed transitions are with probability ~, others are deterministic 
with probability 1. Numbers are rewards in the respective state. 



An intuitively appealing idea is to look for an optimal policy in an index 
form. That is, for each arm the policy maintains an index which is a function of 
time, states, and rewards of this arm only. At each time step, the policy samples 
the arm that has maximal index. This seems promising for at least two reasons: 
First, the distributions of the arms are assumed independent, so it may seem 
reasonable to evaluate them independently as well; second, this works in the 
i.i.d. case (e.g., the Gittins index [TT] or UCB [E| ). This idea also motivates 
the setting when just one out of two arms is Markov and the other is i.i.d., 
see e.g. [5]. Index policies for restless Markov bandits were also studied in [T5] . 
Despite their intuitive appeal, in general, index policies are suboptimal. 

Theorem 1. For each index-based policy tt there is a restless Markov bandit 
problem in which tt behaves suboptimally. 

Proof. Consider the three bandits L (left), C (center), and R (right) in Figured! 
where C and R start in the 1 reward state. (Arms C and R can easily be made 
aperiodic by adding further sufficiently small transition probabilities.) Assume 
that C has been observed in the \ reward state one step before, while R has been 
observed in the 1 reward state three steps ago. The optimal policy will choose 
arm L which gives reward | with certainty (C gives reward with certainty, 
while R gives reward | with probability ^) and subsequently arms C and R. 
However, if arm C was missing, in the same situation, the optimal policy would 
choose R: Although the immediate expected reward is smaller than when choos- 
ing L, sampling R gives also information about the current state, which can earn 
reward | a step later. Clearly, no index based policy will behave optimally in 
both settings. □ 

4 Main Results 

Theorem 2. Consider a restless bandit with K aperiodic arms having state 
spaces Sj, diameters Dj, and mixing times T^ ix (j = 1,...,K). Then with 
probability at least 1 — S the regret of Algorithm^ (presented in Section^ below) 
after T steps is upper bounded by 

const • S ■ T^l ■ njLi(4Di) ' maxlog(A) • log 2 (£) ■ y/f, 



where S := X) J= i * s the total number of states and T m ; x := maxj T^ ix the 
maximal mixing time. Further, the dependence on T m j x can be eliminated to show 
that with probability at least 1 — 8 the regret is bounded by 

O (s ■ nJLi(40,O ■ maxlog(A) ■ log 7/2 (£) ■ Vt) . 

Remark 1. For periodic chains the bound of Theorem [2] has worse dependence 
on the state space, for details see Remark [5] in Section [7] 

Theorem 3. For any algorithm, any K > 1 and any m > 1 there is a K -armed 
restless bandit problem with a total number of S :— Km states, such that the 
regret after T steps is lower bounded by fi(s/ ST). 

Remark 2. While it is easy to see that lower bounds depend on the total number 
of states over all arms, the dependence on other parameters in our upper bound 
is not clear. For example, intuitively, while in the general MDP case one wrong 
step may cost up to D — the MDP's diameter [6] — steps to compensate for, 
here the Markov chains evolve independently of the learner's actions, and the 
upper bound's dependence on the diameter may be just an artefact of the proof. 

5 Constructing the Algorithm 

MDP Representation We represent the setting as an MDP by recalling for 
each arm the last observed state and the number of time steps which have gone 
by since this last observation. Thus, each state of the MDP representation is 
of the form (sj, nj)jL 1 := (s\, n\, S2, ri2, ■ ■ ■ , Sk, nic) with Sj £ Sj and rij € 
N, meaning that each arm j has not been chosen for rij steps when it was in 
state Sj. More precisely, (sj,nj)f =1 is a state of the considered MDP if and 
only if (i) all nj are distinct and (ii) there is a j with rij = The action 
space of the MDP is {1,2,..., K}, and the transition probabilities from a state 

(sj,nj)^ =l are given by the n^-step transition probabilities pj (s, s') of the 
Markov chain underlying the chosen arm j (these are defined by the matrix 
power of the single step transition probability matrix, i.e. Pj 3 ). That is, the 
probability for a transition from state (sj,rij)^ =1 to (s'j,nj)? =1 under action j is 

given by ^""(sj, s'j) iff (i) n'j — 1, (ii) n' e = m + 1 and se = s' t for all £ =^ j. All 
other transition probabilities are 0. Finally, the mean reward for choosing arm j 
in state (sj , nj ) jL x is given by J^seSj ( s i ' s ) ' r i ( s ) ■ This MDP representation 
has already been considered in [8]. 

Obviously, within T steps any policy can reach only states with nj < T . 
Correspondingly, if we are interested in the regret within T steps, it will be 
sufficient to consider the finite sub-MDP consisting of states with nj < T. We call 
this the T-step representation of the problem, and the regret will be measured 
with respect to the optimal policy in this T-step representation. 

3 Actually, one would need to add for each arm j with \Sj \ > 1 a special state for not 
having sampled j so far. However, for the sake of simplicity we assume that in the 
beginning each arm is sampled once. The respective regret is negligible. 



Algorithm 1 The colored Ucrl2 algorithm 

Input: Confidence parameter 8 > 0, aggregation parameter e > 0, state space S, 
action space A, coloring and translation functions, a bound B on the size of the 
support of transition probability distributions. 

Initialization: Set t := 1, and observe the initial state si. 

for episodes k = 1, 2, . . . do 
Initialize episode k: 

Set the start time of episode k, tk := t. Let N k (c) be the number of times a state- 
action pair of color c has been visited prior to episode k, and Vk(c) the number 
of times a state-action pair of color c has been visited in episode k. Compute 
estimates r>(s, a) and p k (s'\s, a) for rewards and transition probabilities, using all 
samples from state-action pairs of the same color c(s,a), respectively. 

Compute policy jr k - 

Let Mk be the set of plausible MDPs with rewards r(s,a) and transition proba- 
bilities p(-\s, a) satisfying 

\r(s,a)-h(s,a)\ < e + y ^ggg gZ^T (2) 

||p(.| fl ,a)-p fc (.|-,a)|| i < (3) 

where C is the number of distinct colors. Let p(n, M) be the average reward of 
a policy n : S — > A on an MDP M € Mk- Choose (e.g. by extended value 
iteration [6]) an optimal policy -kk and an optimistic Mk & Mk such that 

p(n k , M k ) = max{p(7r, M) | tt : S -> A, M £ M k }. (4) 
Execute policy jr k '- 

while w fe (c(s t , 7T fc (si))) < max{l, N k (c(s t , ^k{s t )))} do 

[> Choose action at = 7ffe(st), obtain reward n, and observe next state st+i. 
> Set t :=t + 1. 
end while 
end for 



Structure of the MDP Representation The MDP representation of our 
problem has some special structural properties. In particular, rewards and tran- 
sition probabilities for choosing arm j only depend on the state of arm j, i.e. 
Sj and rij. Moreover, the support for each transition probability distribution is 
bounded, and for rij > r^ ix (e) the transition probability distribution will be 
close to the stationary distribution of arm j. Thus, one could reduce the T-step 
representation further by aggregating states@ (Sj, nj)jL l5 (Sj,n'j)^ = i whenever 

n jT n 'j > ^mix( £ ) an< ^ s ? = s ^> n ? = n 'i f° r ^ ^ 3- The rewards and transition 
probability distributions of aggregated states are e-close, so that the error by 

4 Aggregation of states si , . . . , s n means that these states are replaced by a new 
state Sagg inheriting rewards and transition probabilities from an arbitrary s, (or 
averaging over all Sj). Transitions to this state are set top(s agg |s, a) := . p(sj\s, a). 



Algorithm 2 The restless bandits algorithm 

Input: Confidence parameter S > 0, the number of states Sj and mixing time TL^ 
of each arm j, horizon T. 

> Choose e = 1/vT and execute colored Ucrl2 (with confidence parameter 8) on 
the e-structured MDP described in the "coloring" paragraph at the end of Section [S] 



aggregation can be bounded by results given in [14]. While this is helpful for 
approximating the problem when all parameters are known, it cannot be used 
directly when learning, since the observations in the aggregated states do not 
correspond to an MDP anymore. Thus, while standard reinforcement learning 
algorithms are still applicable, there are no theoretical guarantees for them. 

e-structured MDPs and Colored UCRL2 In the following, we exploit the 
special structure of the MDP representation. We generalize some of its structural 
properties in the following definition. 

Definition 1. An e-structured MDP is an MDP with finite state space S, fi- 
nite action space A, transition probability distributions p(-\s,a), mean rewards 
r(s,a) G [0,1], and a coloring function c : S x A — » C , where C is a set 
of colors. Further, for each two pairs (s, a), (s',a') £ S x A with c(s, a) = 
c(s',a') there is a bijective translation function 4* s ,a,s',a' ■ S — > S such that 
J2 S " \p( s "\ s , a ) -p(<f>s,a,s>,a'(s")\s',a')\ < £ and l r ( s >°) - r(s',a')\ < e. 

If there are states s,s' in an e-structured MDP such that c(s,a) — c(s',a) 
for all actions a and the associated translation function 4> s ^ a .s'.a is the identity, 
we may aggregate the states (cf. footnote |4}. We call the MDP in which all such 
states are aggregated the aggregated e-structured MDP. 

For learning in e-structured MDPs we consider a modification of the Ucrl2 
algorithm of The colored Ucrl2 algorithm is shown in Figure[TJ As the origi- 
nal Ucrl2 algorithm it maintains confidence intervals for rewards and transition 
probabilities which define a set of plausible MDPs A4k- In each episode fc, the 
algorithm chooses an optimistic MDP Mk € Aik and an optimal policy which 
maximize the average reward, cf. (j4]). Colored Ucrl2 calculates estimates from 
all samples of state-action pairs of the same color, and works with respectively 
adapted confidence intervals and a corresponding adapted episode termination 
criterion. Basically, an episode ends when for some color c the number of visits 
in state-action pairs of color c has doubled. 

Coloring the T-step representation Now, we can turn the T-step repre- 
sentation into an e-structured MDP, assigning the same color to state-action 
pairs where the chosen arm is in the same state, that is, c((si,ni)f =1 , j) = 
c((Si,OiLi,iO iff 3 = f, Sj = s'j, and either rij = n'j or n^n^ > T^Je). 
The translation functions are chosen accordingly. This e-structured MDP can 
be learned with colored UCRL2, see Algorithm[2l our restless bandits algorithm. 



(The dependence on the horizon T and the mixing times T^ ix as input parame- 
ters can be eliminated, cf. the proof of Theorem [5] in Section [7J) 



The following is a generalization of the regret bounds for Ucrl2 to e-structured 
MDPs. The theorem gives improved (with respect to Ucrl2) bounds if there 
are only a few parameters to estimate in the MDP to learn. Recall that the 
diameter of an MDP is the maximal expected transition time between any two 
states (choosing an appropriate policy), cf. [B]. 

Theorem 4. Let M be an e-structured MDP with finite state space S , finite ac- 
tion space A, transition probability distributions p(-\s, a), mean rewards r(s, o) € 
[0, 1], coloring function c and associate translation functions. Assume the learner 
has complete knowledge of state-action pairs Q S x A, while the state-action 
pairs in Wu := S x A\ H?k ar £ unknown and have to be learned. However, 
the learner knows c and all associate translation functions as well as an upper 
bound B on the size of the support of each transition probability distribution 
in . Then with probability at least 1 — 5, after any T steps colored Ucrl2@ 
gives regret upper bounded by 



where Cjj is the total number of colors for states in tyjj, and D £ is the diameter 
of the aggregated e-structured MDP. 

The proof of this theorem is given in the appendix. 

Remark 3. For e = 0, one can also obtain logarithmic bounds analogously to 
Theorem 4 of 6 . With no additional information for the learner one gets the 
original Ucrl2 bounds (with a slightly larger constant), trivially choosing B to 
be the number of states and assigning each state-action pair an individual color. 

7 Proofs 

We start with bounding the diameter in the aggregated e-structured MDP. 

Lemma 1. For e < 1/4, the diameter D £ in the aggregated e-structured MDP 
can be upper bounded by 2 |" log 2 (4 maxj Dj) \ ■ T m i x (s) ■ Y\f = i{^Dj), where we set 



5 For the sake of simplicity the algorithm was given for the case 'Pk = 0. It is obvious 
how to extend the algorithm when some parameters are known. 



6 Regret Bounds for Colored UCRL2 




T mix (e) := max i T^ ix (e). 



Proof. Let /ij be the stationary distribution of arm j. It is well-known that 
the expected first return time Tj(s) in state s satisfies /^(s) = 1/tj-(s). Set 
tj := max s Tj(s), and r := maxj Tj. Then, Tj < 2Dj. 

Now consider the following scheme to reach a given state (sj,nj)j =1 : First, 
order the states (sj,rij) descendingly with respect to rij. Thus, assume that 
n ji > n j2 > • • • > Tij K = 1. Take T m i x (e) samples from arm j%. (Then each arm 
will be e-close to the stationary distribution, and the probability of reaching the 
right state Sj i when sampling arm afterwards is at least Hj^ji) ~ £ -) Then 
sample each arm j'2, J3, ■ • ■ exactly rij i _ 1 — rij i times. 

We first show the lemma for e < ; — mm j> Mj( s )/2- As observed be- 
fore, for each arm ji the probability of reaching the right state Sj i is at least 
A*i»( s j«) ~~ e — Mii( s ji)/2- Consequently, the expected number of restarts of 
the scheme necessary to reach a particular state (sj, n J -)jL 1 is upper bounded 

by rX,-_i 2/Mi( s j)- As each trial takes at most 2T m j X (e) steps, recalling that 
l/fij(s) = tj(s) < 2Dj proves the bound for e < ^o- 

Now assume that e > fiQ. Since D £ < D £ i for e > e' we obtain a bound 
of 2T mlx (£')IT*_i(4Dj) with e' := /i = 1/2t. By ©, we have T mix (e') < 
[log 2 (l/e')l 7mix(l/4) < [log 2 (4T)] Tmjx^), which proves the lemma. □ 

Proof of Theorem [2j Note that in each arm j the support of the transi- 
tion probability distribution is upper bounded by \Sj\. Hence, Theorem [4] with 
Cu — J2f=i \ ^Lix( £ ) an d B — maxj \ Sj\ shows that the regret is bounded 
by 42£ eV /max 2 1^| • Ef=i\S 3 \ ■ T[j/) ■ T\og (f) + e{D e + 2)T with probabil- 
ity > 1 — 5- Since e — 1/VT, this proves the first bound by Lemma [T] and 
recalling (|TJ) . 

If the horizon T is not known, guessing T using the doubling trick (i.e., 
executing the algorithm for T = 2 l with confidence parameter 5/2 1 in rounds 
i = 1,2, . . .) achieves the bound given in Theorem [2] with worse constants. 

Similarly, if T m i X is unknown, one can perform the algorithm in rounds 
i = 1,2,... of length 2 Z with confidence parameter <5/2 l , choosing an increas- 
ing function a(t) to guess an upper bound on T mlx at the beginning t of each 
round. This gives a bound of order a(T) 3 / 2 y/T with a corresponding addi- 
tive constant. In particular, choosing a(t) = \ogt the regret is bounded by 
0(S ■ Uf^Dj) ■ max * !og(A) • log 7 /2 (T/S) ■ VT) with probability > 1 - S. □ 

Remark 4- Whereas it is not easy to obtain upper bounds on the mixing time 
in general, for reversible Markov chains T m i x can be linearly upper bounded by 
the diameter, cf. Lemma 15 in Chapter 4 of [15]. While it is possible to compute 
an upper bound on the diameter of a Markov chain from samples of the chain, 
we did not succeed in deriving any useful results on the quality of such bounds. 

Remark 5. Periodic Markov chains do not converge to a stationary distribution. 
However taking into account the period of the arms, one can generalize our 
results to the periodic case. Considering in an m-periodic Markov chain the in- 
step transition probabilities given by the matrix P m , one obtains m distinct 



aperiodic chains (depending on the initial state) each of which converges to a 
stationary distribution with respective mixing times. The maximum over these 
mixing times can be considered to be the mixing time of the chain. 

Thus, instead of aggregating states (sj,rij), (s'j,n'j) with nj,n'j > T^ ix (e) as 
in the case of aperiodic chains, one aggregates them only if additionally nj = 
mod rrij. If the periods rrij are not known to the learner, one can use the least 
common denominator of 1, 2, . . . , \Sj\ as period. Since by the prime number the- 
orem the latter is exponential in \Sj\, the obtained results for periodic arms show 
worse dependence on the number of states. (Concerning the proof of Lemma [TJ 
the sampling scheme has to be slightly adapted so that one samples in the right 
period when trying to reach a particular state.) 

Proof of Theorem [3j Consider K arms all of which are deterministic cycles 
of length m and hence m-periodic. Then the learner faces m distinct learning 
problems with K arms, each of which can be made to force regret of order 
Q{^J KT /m) in the T/m steps the learner deals with the problem [4]. Overall, 
this gives the claimed bound of J?(v 'mKT) = i?(V ST). Adding a sufficiently 
small probability (with respect to the horizon T) of staying in some state of each 
arm, one obtains the same bounds for aperiodic arms. □ 

8 Extensions and Outlook 

Unknown state space. If (the size of) the state space of the individual arms is 
unknown, some additional exploration of each arm will sooner or later determine 
the state space. Thus, we may execute our algorithm on the known state space 
where between two episodes we sample each arm until all known states have been 
sampled at least once. The additional exploration is upper bounded by O(logT), 
as there are only 0(log T) many episodes, and the time of each exploration phase 
can be bounded with known results. That is, the expected number of exploration 
steps needed until all states of an arm j have been observed is upper bounded by 
Dj log(3|Sj |) (cf. Theorem 11.2 of [TO]), while the deviation from the expectation 
can be dealt with by Markov inequality or results from [16] . That way, one 
obtains bounds as in Theorem [2] for the case of unknown state space. 

Improving the bounds. All parameters considered, there is still a large gap 
between the lower and the upper bound on the regret. As a first step, it would 
be interesting to find out whether the dependence on the diameter of the arms 
is necessary. Also, the current regret bounds do not make use of the interdepen- 
dency of the transition probabilities in the Markov chains and treat n-step and 
n'-step transition probabilities independently. Finally, a related open question is 
how to obtain estimates and upper bounds on mixing times. 

More general models. After considering bandits with i.i.d. and Markov arms, 
the next natural step is to consider more general time-series distributions. Gen- 
eralizations are not straightforward: already for the case of Markov chains of 
order (or memory) 2 the MDP representation of the problem (Section [5]) breaks 
down, and so the approach taken here cannot be easily extended. Stationary 



ergodic distributions are an interesting more general case, for which the first 
question is whether it is possible to obtain asymptotically sublinear regret. 
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A Proof of Theorem 0] 



Splitting into Episodes We follow the proof of Theorem 2 in [5J. First, as 
shown in Section 4.1 of [6J, setting Ak '■= J2s a v k(s, a ){p* ~ r ( s > a )) with prob- 
ability at least 1 — 12 ^ 5/4 the regret after T steps can be upper bounded by 



Failing Confidence Intervals Concerning the regret with respect to the true 
MDP M being not contained in the set of plausible MDPs Aik, we cannot use the 
same argument (that is, Lemma 17 in Appendix C.l) as in [6J, since the random 
variables we consider for rewards and transition probabilities are independent, 
yet not identically distributed. 

Instead, fix a state-action pair (s, a), let S(s,a) be the set of states s' with 
p(s'\s,a) > and recall that f(s,a) and p(-\s,a) are the estimates for rewards 
and transition probabilities calculated from all samples of state-action pairs of 
the same color c(s, a). Now assume that at step t there have been n > samples 
of state-action pairs of color c(s, a) and that in the i-th sample action ai has been 
chosen in state Sj and a transition to state has been observed (i = 1, . . . , n). 
Then 



E 

s'eS(s,o) 



p(-\s,a)-E\p(-\s,a)}\\ = V \p(s'\s, a) - E\p(s'\s, a)} 



< sup V (p(s'\s,a)-E\p(s'\s,a)})x(s') 

n 

= SU P nTl W^( 1 o i ,»,aW)) - y]p(fi'ki)Oi) • a;(0 Si .ai. s , a (s'))) ■ (6) 
ze{o,i}is(=,a)i ^ v y 

For fixed i G {0, 1}\ S M\, X t := x(^ i)ai , s ,a(s-))-E s < p(s'\ Si , ai)-x((l> Si , auSta {s')) 
is a martingale difference sequence with < 2, so that by Azuma-Hoeffding 
inequality (e.g., Lemma 10 in [6j), Pr{ J27=i > < exp(— # 2 /8n) and in 
particular 



Pr { 2£ x ^ > v /56Bnlog(Mk)} < 



7B 

< 



2 B 20t 7 Cu ■ 



Recalling that by assumption |,S(s,a)| < B, & union bound over all sequences 
x e {0, 1}I 5 ( S ' Q )I then shows from © that 

PrllpOM-Elft.^a)]^ > log (ACut/S)} < (7) 

Concerning the rewards, as in the proof of Lemma 17 in Appendix C.l of [5J 
- but now using Hocffding for independent and not necessarily identically dis- 
tributed random variables — we have that 



Pr 



{\r(s,a) -E[r(s,a)}\ > log (2C v t/S)} < (8) 



A union bound over all t possible values for n and all Cjj colors of states in Wu 
shows that the confidence intervals in ([7]) and (jHJ hold with probability at least 
1 — for the actual counts N(c(s,a)) and all state-action pairs (s,a). (Note 
that equations ([7]) and ([5]) are the same for state-action pairs of the same color.) 

By linearity of expectation, E[r(s, a)] can be written as ^ Y^ii=i r ( s «' a f° r 
the sampled state-action pairs (s$, a{). Since the (sj, a{) are assumed to have the 
same color c(s,a), it holds that |r(sj,aj) — r(s,a)\ < s and hence |E[r(s, a)] — 
r(s,a)\ < e. Similarly, ||E[p(-|s, a)] — p^^a)^ < e. Together with (J7J and © 
this shows that with probability at least 1 — jije for all state-action pairs (s, a) 



\f(s, a) - r(s, a)\<e+^/£ log (2Cut/S), (9) 
p(-\s,a) -pMa.a)^ < e + ^log(4Q,i/<*)- (10) 



Thus, the true MDP is contained in the set of plausible MDPs A4(t) at step i 
with probability at least 1 — j§^, just as in Lemma 17 of [BJ. The argument that 

ELl^A^-M* < (11) 



with probability at least 1 — 12 ^ 5/4 then can be taken without any changes from 
Section 4.2 of 0. 



Episodes with M 6 J\Ak Now assuming that the true MDP M is in Aik, we 
first reconsider extended value iteration. In Section 4.3.1 of [BJ it is shown that for 
the state values Ui{s) in the z-th iteration it holds that max s Ui(s) — min s Uj(s) < 
D, where D is the diameter of the MDP. Now we want to replace D with the 
diameter D £ of the aggregated MDP. For this, first note that for any two states 
s, s' which are aggregated we have by definition of the aggregated MDP that 
Ui(s) = Ui(s'). As it takes at most D e steps on average to reach any aggregated 
state, repeating the argument of Section 4.3.1 of [ijj shows that 

max s Ui(s) — min s Uj(s) < D e . (12) 

Let Pk :— [pk(s'\s, 7Tfe(s))) g gl be the transition matrix of 7Tfc on Mfe, and 
Vk '■= (vk(s, 7ffc(s))) the row vector of visit counts in episode k for each state 
and the corresponding action chosen by ttu- Then as shown in Sect. 4.3.1 of 6jl 

A k < v k (P k - I)w k + }^Vk(s, a)(r k (s,a) -r(s,a)), 

s,a 

where Wk is the normalized state value vector with Wk(s) := u(s) — (min s u(s) — 
max s u(s))/2, so that \\wk\\ < Now for (s, a) € we have ?%(«, a) = r(s, a), 
while for (s, a) € the term ffe(s, a) — r(s, a) < |ffc(s, a) — ? ; fe(s, a)| + |r(s, a) — 



Here we neglect the error by value iteration explicitly considered in Sect. 4.3.1 of [6]. 



fk(s,a)\ is bounded according to © and ©, as we assume that M k ,M £ M k . 
Summarizing state-action pairs of the same color we get 



A k <v k (P k -l)w k + 2 Mc).(e+^^^), 

where G{&xj) is the set of colors of state- action pairs in ^jj. Let T k be the length 
of episode k. Then noting that N' k (c) := max{l, N k (c)} <t k <T we get 



A k < v k (P k - I)w k + 2eT k + E (13) 

The True Transition Matrix Let P k '■— (p(s'\s ) : ir k (s))) , be the transition 
matrix of 7r k in the true MDP M. We split 

v k (P k - I)w k = v k (P k - P k )w k + v k (P k - I)w k . (14) 

By assumption M kl M £ A4 k , so that using © and (flU)) the first term in (fl"4"| 
can be bounded by (cf. Section 4.3.2 of [5]) 

v k (P k -P k )w k < y^^v k (s,a) ■ \\pk(-\s,a) -p(-\s,a)\\^ ■ ||tOfc||oo 
<2 v k(c)-(e+^ 



l b&B\og(ACuT/8) \ De 
N' k (c) 



_ Ufc(c) 

cec(-f[/) ' 



since — as for the rewards — the contribution of state-action pairs in <Pk is 0. 

Concerning the second term in (fl4| . as shown in Section 4.3.2 of [6] one has 
with probability at least 1 — 12 ^ 5 / 4 



5> fc (P k -I)w k l M eM k <-D eV /fTlog(*£)+I) E Calog 2 (^), (16) 



fc=i 



where m is the number of episodes, and the bound m < Cjj log 2 (8T/C'u) used 
to obtain (fTo) is derived analogously to Appendix C.2 of [5]. 

Summing over Episodes with Af 6 To conclude, we sum ([TB")) over 

all episodes with M S jMj;, using (fl4")l . (IT51) . and (fTB| . which yields that with 
probability at least 1 — 12 £ 5/4 



]T 4,1 M£ m, < £> £ V fTlog (f ) + £ £ C7 y log 2 + e(D £ + 2)T 



fc=i 



iV 56B M^W 14 M^))E E (17) 



_«fe(c) 



As in Sect. 4.3.3 and Appendix C.3 of |B], one obtains J^ceC'ifu) Efc / // < 

(V2 + 1) VQjT. Thus, evaluating © by summing Ak over all episodes, by (jlip 
and (fT7|) the regret is upper bounded with probability > 1 — 4T l/ 4 by 

m m 

£ zvaM^, + E ^Imgm* + y^friogTf) 
fe=i fc=i 

< #Mf) + ^ + D E v/|Tlog(f) + D s Q, log 2 ( «L) 
+e(L> £ + 2)T + 3(\/2 + l)D e ^UBC v T log (^^) . 
Further simplifications as in Appendix C.4 of 6 finish the proof. □ 



